feat: re-land graceful scale-down for AlluxioRuntime with e2e test#6061
feat: re-land graceful scale-down for AlluxioRuntime with e2e test#6061jakharmonika364 wants to merge 6 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @jakharmonika364. Thanks for your PR. I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Code Review
This pull request introduces a new end-to-end test suite for testing the graceful scale-down of AlluxioRuntime, including the test script and associated Kubernetes resource manifests. It also exposes the feature gate flags in the Alluxio controller command. The review feedback highlights a critical missing blank import required to register the feature gates, and suggests improvements to the E2E test script, such as simplifying the wait-loop logging logic and properly scoping variables as local to prevent global namespace pollution.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Fixed the wait_dataset_bound and wait_job_completed issues. The blank import suggestion isn't needed though - pkg/ddc/alluxio already imports pkg/features directly (replicas.go), and cmd/alluxio/app/alluxio.go imports pkg/ddc/alluxio, so the init() already runs. Verified --feature-gates works via go run ./cmd/alluxio start --help. |
656ceff to
5860b3b
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6061 +/- ##
==========================================
+ Coverage 64.77% 64.89% +0.11%
==========================================
Files 484 486 +2
Lines 33892 34058 +166
==========================================
+ Hits 21954 22101 +147
- Misses 10215 10229 +14
- Partials 1723 1728 +5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Re-lands the AlluxioRuntime graceful worker scale-down feature by adding a new Kind-based GitHub Actions e2e scenario and exposing the --feature-gates flag on the alluxioruntime-controller binary so the gate can be enabled in real deployments.
Changes:
- Add a new
test/gha-e2e/alluxio-scaledown/e2e scenario that scales an AlluxioRuntime from 2→1 and verifies data access before/after. - Expose feature gates on
alluxioruntime-controllerviaDefaultMutableFeatureGate.AddFlag(...). - Wire the new e2e scenario into
.github/scripts/gha-e2e.shso it runs in CI.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/gha-e2e/alluxio-scaledown/test.sh | New e2e driver script that enables the gate, creates runtime, scales down, and re-reads data |
| test/gha-e2e/alluxio-scaledown/read_before_job.yaml | Job used to validate data access before scale-down |
| test/gha-e2e/alluxio-scaledown/read_after_job.yaml | Job used to validate data access after scale-down |
| test/gha-e2e/alluxio-scaledown/dataset.yaml | Dataset + 2-replica AlluxioRuntime manifest for the scale-down scenario |
| cmd/alluxio/app/alluxio.go | Adds --feature-gates flag exposure for the alluxio controller binary |
| .github/scripts/gha-e2e.sh | Runs the new alluxio scale-down e2e scenario in the CI e2e sequence |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…dvancedStatefulSet (fluid-cloudnative#4193) (fluid-cloudnative#5805)" (fluid-cloudnative#6059) This reverts commit 333e13a. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
Adds a real-cluster scenario exercising the scale-down -> drain -> decommission -> data-integrity path: bind a 2-replica AlluxioRuntime, read data, scale down to 1 replica, wait for the worker StatefulSet to actually converge (proving the drain doesn't stall), then read again. Also wires --feature-gates into the alluxioruntime-controller binary (cmd/alluxio/app/alluxio.go), mirroring cmd/csi and cmd/fluidapp - without it GracefulWorkerScaleDown was registered but unreachable on a real deployment, so the e2e test enables it via a deployment patch. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
Simplify wait_dataset_bound's elapsed-time tracking to a single counter (matches wait_worker_replicas), and declare succeed/job_failed as local in wait_job_completed to avoid polluting global scope. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
- check kubectl patch exit codes instead of letting failures surface as confusing downstream timeouts - log the WorkerDecommissioning condition while polling so a stuck drain is debuggable from CI output - replace the external apache.org mirror mount with an in-cluster MinIO bucket seeded with one known object, and assert its exact content before/after scale-down instead of just checking the directory is non-empty - the previous assertion would pass even if decommission silently dropped cached data instead of migrating it Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
c3028aa to
af5935d
Compare
|
Found why CI was failing: this branch forked from master before #6059's revert merged, so the PR's merge-ref was silently dropping the whole feature (pkg/features, replicas.go logic, etc.) since my commits never touched those files directly. Rebuilt the branch off current master with the revert reverted, then replayed the e2e/flag-fix commits on top - --feature-gates=GracefulWorkerScaleDown=true now actually registers. |
| - mountPoint: s3://scaledown/ | ||
| name: scaledown | ||
| options: | ||
| alluxio.underfs.s3.endpoint: "http://scaledown-minio:9000" |
There was a problem hiding this comment.
Leaving this as http - it's an in-cluster only MinIO with throwaway test creds, same as curvine's existing endpoint_url: http://minio:9000 fixture. No real traffic to protect here.
Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
Alluxio mounts s3://scaledown/ at /scaledown inside its namespace (named after the Dataset mount's name), not at the FUSE root. The existing alluxio test never noticed this because its assertion just checks "ls /data" is non-empty, which passes either way. Use the correct /data/scaledown/fixture.txt path so the content check actually exercises the fixture. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
|



Re-lands #5805, reverted in #6059 because it had no e2e coverage of the actual scale-down -> drain -> decommission -> data-integrity path.
This PR adds that e2e test under test/gha-e2e/alluxio-scaledown/: binds a 2-replica AlluxioRuntime, reads data, scales down to 1 replica, waits for the worker StatefulSet to actually converge, then reads again.
While wiring this up I found GracefulWorkerScaleDown was registered as a feature gate but never exposed as a flag on the alluxioruntime-controller binary (cmd/alluxio/app/alluxio.go), unlike csi and fluidapp, so it was unreachable in any real deployment. Fixed that too - the e2e test enables it via a deployment patch since there's no Helm value for it yet.
Original feature code is unchanged from #5805.